NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification

https://doi.org/10.21437/Interspeech.2025-2532

Hoq, Enjamamul; Gupta, Nikhil; Omondi, Danielle; Nwogu, Ifeoma (August 2025, ISCA)

Free, publicly-accessible full text available August 17, 2026
Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis

Chaudhary, Lipisha; Xu, Fei; Nwogu, Ifeoma (December 2024, Springer Nature)

Full Text Available
Ig3D: Integrating 3D Face Representations in Facial Expression Inference

https://doi.org/10.1007/978-3-031-91581-9_29

Dong, Lu; Wang, Xiao; Setlur, Srirangaraj; Govindaraju, Venu; Nwogu, Ifeoma (January 2025, Springer Nature Switzerland)

Full Text Available
Ig3D: Integrating 3D Face Representations in Facial Expression Inference

Dong, Lu; Wang, Xiao; Setlur, Srirangaraj; Govindaraju, Venu; Nwogu, Ifeoma (August 2024, Springer_Science+Business_Media)

Reconstructing 3D faces with facial geometry from single images has allowed for major advances in animation, generative models, and virtual reality. However, this ability to represent faces with their 3D features is not as fully explored by the facial expression inference (FEI) community. This study therefore aims to investigate the impacts of integrating such 3D representations into the FEI task, specifically for facial expression classification and face-based valence-arousal (VA) estimation. To accomplish this, we first assess the performance of two 3D face representations (both based on the 3D morphable model, FLAME) for the FEI tasks. We further explore two fusion architectures, intermediate fusion and late fusion, for integrating the 3D face representations with existing 2D inference frameworks. To evaluate our proposed architecture, we extract the corresponding 3D representations and perform extensive tests on the AffectNet and RAF-DB datasets. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art AffectNet VA estimation and RAF-DB classification tasks. Moreover, our method can act as a complement to other existing methods to boost performance in many emotion inference tasks.
more » « less
Full Text Available
Towards Open Domain Text-Driven Synthesis of Multi-person Motions

Shan, Mengyi; Dong, Lu; Han, Yutao; Yao, Yuan; Liu, Tao; Nwogu, Ifeoma; Qi, Guo_Jun; Hill, Mitch (October 2024, Springer_Science+Business_Media)

This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While singleperson text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multiperson motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
more » « less
Full Text Available
Dataset Infant Anonymization with Pose and Emotion Retention

https://doi.org/10.1109/FG59268.2024.10581938

Lary, Mason; Klawonn, Matthew; Messinger, Daniel; Nwogu, Ifeoma (May 2024, IEEE)

We demonstrate a procedure for the anonymization of infant subjects in videos such that salient behavioral information is retained. This method also creates a new identity that is consistent temporally across video frames. We present an overview of this anonymization process, which involves moving through the latent space of a generative model with an infant specific latent space traversal technique. We apply the technique on videos of infants, a historically difficult source of data, and make comparisons to other state-of-the-art anonymization systems. Metrics demonstrate an improved ability to retain emotional content of videos during the anonymization process, even during extreme emotions or poses, while maintaining a consistent identity throughout.
more » « less
Full Text Available
SignAvatar: Sign Language 3D Motion Reconstruction and Generation

https://doi.org/10.1109/FG59268.2024.10581934

Dong, Lu; Chaudhary, Lipisha; Xu, Fei; Wang, Xiao; Lary, Mason; Nwogu, Ifeoma (May 2024, IEEE)

Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page
more » « less
Full Text Available
A Comparative Study of Video-Based Human Representations for American Sign Language Alphabet Generation

https://doi.org/10.1109/FG59268.2024.10582020

Xu, Fei; Chaudhary, Lipisha; Dong, Lu; Setlur, Srirangaraj; Govindaraju, Venu; Nwogu, Ifeoma (May 2024, IEEE)

Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: twostream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformerbased Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work.
more » « less
Full Text Available
Word-Conditioned 3D American Sign Language Motion Generation

https://doi.org/10.18653/v1/2024.findings-emnlp.584

Dong, Lu; Wang, Xiao; Nwogu, Ifeoma (January 2024, Association for Computational Linguistics)

Sign words are the building blocks of any sign language. In this work, we present wSignGen, a word-conditioned 3D American Sign Language (ASL) generation model dedicated to synthesizing realistic and grammatically accurate motion sequences for sign words. Our approach leverages a transformer-based diffusion model, trained on a curated dataset of 3D motion meshes from word-level ASL videos. By integrating CLIP, wSignGen offers two advantages: image-based generation, which is particularly useful for children learning sign language but not yet able to read, and the ability to generalize to unseen synonyms. Experiments demonstrate that wSignGen significantly outperforms the baseline model in the task of sign word generation. Moreover, human evaluation experiments show that wSignGen can generate high-quality, grammatically correct ASL signs effectively conveyed through 3D avatars.
more » « less
Full Text Available
A Robust Backpropagation-Free Framework for Images

Zee, Timothy; Ororbia, Alexander G; Mali, Ankur; Nwogu, Ifeoma (November 2023, Transactions on machine learning research)
Richards, Blake A (Ed.)
While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the “weight transport problem”. Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error-kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA-trained CNN that employs a non-differentiable activation function.
more » « less
Full Text Available

« Prev Next »

Search for: All records